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ABSTRACT 


In a data distribution scenario the sensitive data given to agents can be leaked in some cases and can be found in unauthorized places. 
Our aim is to detect when the distributor’s sensitive data have been leaked by agents and if possible to identify the agent who leaked the data. We 
consider the addition of fake objects to the distributed set which do not correspond to real entities but appear realistic to the agents. The 
distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by 
other means. We also present data allocation strategies and algorithms for distributing objects to agents, in a way that improves our chances of 
identifying a leaker. Our main idea is to prevent the agents from comparing their data with one another to identify fake objects. A Symmetric 
Inference Model (SIM) is used here to find out the probability of identifying dependency among the data distributed to various agents. Using this 
technique a symmetric inference graph (SIG) is drawn denoting the links among data sets. 


I. INTRODUCTION 


An association rule is defined as the implication X - 
>Y, described by two interestingness measures— 
support and confidence—where X and Y are the sets 
of items and X n Y =o. Apriori is the first algorithm 
proposed in the association rule mining field and 
many other algorithms were derived from it. Starting 
from a database, it proposes to extract all association 
rules satisfying minimum thresholds of support and 
confidence. 

It is very well known that mining algorithms can 
discover a prohibitive amount of association rules; 
for instance, thousands of rules are extracted from a 
database of several dozens of attributes and several 
hundreds of transactions. 

Furthermore, valuable information is often 
represented by those rare—low  support—and 
unexpected association rules which are surprising to 
the user. So, the more increase the support threshold, 
the more efficient the algorithms are and the more the 
discovered rules are obvious, and hence, the less they 
are interesting for the user. As a result, it is necessary 
to bring the support threshold low enough in order to 
extract valuable information. 

Unfortunately, the lower the support is, the larger the 
volume of rules becomes, making it intractable for a 
decision-maker to analyze the mining result. 
Experiments show that rules become almost 
impossible to use when the number of rules 
overpasses 100. Thus, it is crucial to help the 
decision-maker with an efficient technique for 
reducing the number of rules. 

To overcome this drawback, several methods were 
proposed in the literature. On the one hand, different 
algorithms were introduced to reduce the number of 
item sets by generating closed , maximal or optimal 
item sets, and several algorithms to reduce the 


number of rules, using non redundant rules or 
pruning techniques. On _ the other hand, 
postprocessing methods can improve the selection of 
discovered __ rules. Different | complementary 
postprocessing methods may be used, like pruning, 
summarizing, grouping, or visualization. Pruning 
consists in removing uninteresting or redundant rules. 
In summarizing, concise sets of rules are generated. 
Groups of rules are produced in the grouping process; 
and the visualization improves the readability of a 
large number of rules by using adapted graphical 
representations. However, most of the existing post 
processing methods are generally based on statistical 
information in the database. Since rule 
interestingness strongly depends on user knowledge 
and goals, these methods do not guarantee that 
interesting rules will be extracted. 

For instance, if the user looks for unexpected rules, 
all the already known rules should be pruned. Or, if 
the user wants to focus on specific schemas of rules, 
only this subset of rules should be selected. 
Moreover, as suggested the rule postprocessing 
methods should be imperatively based on a strong 
interactivity with the user. The representation of user 
knowledge is an important issue. The more the 
knowledge is represented in a flexible, expressive, 
and accurate formalism, the more the rule selection is 
efficient. 


Motivation for the General Impression 
Improvement Using Ontologies: 


In the Semantic Web context, the number of available 
ontologies has been increasing covering a wide 
domain of applications. This could be a great 
advantage in an ontology-based user knowledge 
representation. 
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One of our most important contributions relies on 
using ontologies as user background knowledge 
representation. Thus, extended the specification 

e General Impressions (GD), 

e Reasonably Precise Concepts (RPC), and 

e ~=Precise Knowledge (PK) 


Supermarket item Taxonomy 


Items 
Soa Cool Drinks Snacks 


ZV SU AN, 


Savlon Pears Dove Fanta Pepsi Cocola Cake Chocolate 
Ontology Description: 
Items 


Cosmetics Food Items 


es 


Fairever Soap Cooldrinks Snacks 


Concise Representations of Frequent Itemsets: 


Interestingness measures represent metrics in the 
process of capturing dependencies and implications 
between database items, and express the strength of 
the pattern association. 

Since frequent itemset generation is considered as an 
expensive operation, mining frequent closed itemsets 
was proposed in order to reduce the number of 
frequent itemsets. For example, an itemset X is 
denoted as closed frequent itemset if itemset X’o X 
so that t(X) =t(X’). Thus, the number of frequent 
closed itemsets generated is reduced in comparison 
with the number of frequent itemsets. 

The CLOSET algorithm was proposed as a new 
efficient method for mining closed itemsets. 
CLOSET uses a novel frequent pattern tree (FP-tree) 
structure, which is a compressed representation of all 
the transactions in the database. Moreover, it uses a 
recursive divide-and-conquer and database projection 
approach to mine long patterns. 


Redundancy Reduction of Association Rules: 

To discover hidden correlations, association rule 
mining methods use two important constraints known 
as support and confidence. However, mining methods 
are often unable to find the best value for these 
constraints: large number of rules when _ these 
thresholds are low; very few rules when these 
thresholds are high. In addition, regardless of these 
above thresholds, mining methods produce many 
rules that have identical meaning or, redundant rules. 
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Indeed such redundant rules seem as a main 
impediment to efficient utilisation of discovered 
rules, and should be removed. Identify those rules 
that are redundant and eliminate them. 

‘Redundancy reduction’ refers to a class of 
techniques specifically aimed at pruning out patterns 
that do not convey new information. They address the 
quantity problem and the associated understandability 
issue by succinct characterization of the domain. If a 
set of rules refer to the same feature of the data, then 
the most general rule may be retained. 

‘Rule covers’is a method that retains a subset of the 
original set of rules. This subset refers to all rows (in 
a relational database) that the original rule set 
covered. Another strategy in AR mining is to 
determine a subset of frequently occurring closed 
itemsets from their supersets. Though, the subset’s 
cardinality is much lower than that of the superset, 
there is no loss of information. Sometimes, one rule 
can be generated from another using a certain 
inference system. Retaining only the basic rules may 
reduce the cardinality. In addition, the basic rules 
may provide users with a bird’s eye-view of the 
domain. Such inference systems can also recover the 
original rule set using reversible mechanisms. Thus, 
information content of the basic un-pruned set is 
retained. 


Il. ABOUT DATASET 


A data set is a named collection of data that contains 
individual data units organized (formatted) in a 
specific, IBM-prescribed way and accessed by a 
specific access method that is based on the data set 
organization. Types of data set organization include 
sequential, relative sequential, indexed sequential, 
and partitioned. 

Sales forecasting is a difficult area of management. 
Most managers believe they are good at forecasting. 
Sales forecasts can be based on three types of 
information: 

(1) What customers say about their intentions to 
continue buying products in the industry 

(2) What customers are actually doing in the market 
(3) What customers have done in the past in the 
market 

There are many market research businesses that 
undertake surveys of customer intentions — and sell 
this information to businesses that need the data for 
sales forecasting purposes. The value of a customer 
intention survey increases when there are a relatively 
small number of customers, the cost of reaching them 
is small, and they have clear intentions. An 
alternative way of measuring customer intentions is 
to sample the opinions of the sales force or to consult 
industry experts. 
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In Sales forecasting Supermarket has taken as Dataset 
Collection of information about the supermarket. The 
data miner analysis the items based on the request 
queries of customer. 

The field which are taken in the supermarket is name 
of the customer, the number of items and name of the 
items. Identify each transaction and each item by 
giving id name. Based on the association rule(X, Y- 
>Z)where X and Y are items. When the Customer has 
purchased the X and Y items can also buy Z items. 
The rule which is created for the transaction will be 
available. 

For example, in supermarket when the customer buys 
the cool drinks and soap can also buy Snacks. 


good day 

biscuit Savion -> pepsi 
Lays Fairever  -> Cocola 
Kurkure Nivea > Pepsi 
Tiger Dathiri -> Pepsi 
hide&seek Ponds > Pepsi 
Cookies Hamamam -> Miranda 
Dairymilk Amla > 7Tup 
Savlon Fanta-> Biscuits 
Tiger Savlon -> Cocola 
hide&seek Fairever — -> Miranda 
Cookies Ponds ->> Fanta 
Cheethos Dove > Tup 
Pepsi Himalaya -> Lays 
Tiger Nivea -> Lays 
Lays Gini -> Liondates 
Liondates Lays -> Nivea 
Milkybar Cheetos -> Ponds 
Parlig Kurkure — -> Vvdoil 
Lays Pepsi -> Dairymilk 


Ill. DETALIED DESCRIPTION OF DATASET 


Distribution Items: 

Distributed item which keeps each item as 
different name that indicate the repeated items. 
Missed items: 

Missed items describe the missing value or item in 
the dataset. 
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™ Missed Counts 


Paired Items: 
Paired Item which is used to predict the most 
repeatable paired values occurred in the dataset. 
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IV. CONCLUSION 


Redundancy reduction methods may not provide a 
holistic picture if the size of the pruned rule-set is 
large. Specific knowledge with respect to certain 
customers may be lost, especially if the process has a 
bias towards generalization. Hence, in e-commerce 
applications, such methods may help in the concise 
characterization of a population of customers but not 
in generating descriptions of individual customers. A 
subset of the customer population may be described 
in general terms with new customers being assigned 
to these population segments. A method arriving at 
generalizations might remove interesting exceptions. 
Thus the important issue of identification of 
interesting patterns is left unaddressed. 
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